1 Introduction

With platforms offering cheap flights and accommodations, travelling around in Europe is common among students nowadays. Sharing economy services such as Airbnb have facilitated the search for a spare room rented out by a private agent. Profits of the platform have skyrocketed in the past years and a typical UK host now earns around 3000 pounds a year (Cox, 2017). As those profits are paid by the user, understanding the pricing of Airbnb offers becomes crucial.

If you are on a tight budget you want to get the best value for your money. At the same time, you have certain expectations about the type of flat, its location and the offered amenities. This paper ‘s aim is to explore the various factors influencing the price and to set up a regression model which explains the pricing on Airbnb. So you can find the right price for the right flat.

2 Description of the dataset

The dataset of this paper covers all AirBnB offerings in London as per the 4th and 5th of March 2017. It contains 53,904 observations for 95 different variables. Its source is the website “Inside AirBnB - Adding data to the debate” (Cox, 2017). This is an independent and non-commercial project aiming to examine the effect of AirBnB activities on urban development.

To allow this investigation to be more focused, the dataset was narrowed down. Only private rooms with at least three valid ratings were included. The resulting dataset has 6,495 observations for 78 variables.

2.1 Price

Table 1: Descriptives of the Price
Min Q1 Median Mean Q3 Max
8 35 45 50 59 590

A room in London costs on average 50 GBP per night. The summary statistics show that 75% of all AirBnBs are priced at £59 per night or less. However, there are some severe outliers that range up to a maximum of £590.

This raises concerns about the normality of its distribution. In fact, the plot to the left shows the distribution is not normal. In order to normalize the presented data set, the price is converted with a natural logarithm.

Figure 1: Density of Price and ln(Price)

Figure 1: Density of Price and ln(Price)

2.2 Rent

## 
##  Pearson's product-moment correlation
## 
## data:  data_short$mean_rent and data_short$price_log
## t = 46.089, df = 7018, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4638676 0.4997879
## sample estimates:
##       cor 
## 0.4820303

With London being one of the most expensive cities to live, rent prices are a major cost of being a host on AirBnB. Rents are also an interesting indicator of the attractiveness of the neighbourhood. Therefore, the impact of the underlying rent on the AirBnB price has to be accounted for. The initial dataset holds no information on the regular rent price at the location of an AirBnB. Fortunately, a website called “Find Properly” (Lokku Ltd., 2017) utilizes the data from Zoopla and provides the rent and selling price for each London region, divided per post code. Using the post code, the average weekly rent for 1-bed properties is merged with the AirBnB data set. The matching was done based on the Outward code.

Geographical mapping the mean rent and the logarithmically transformed AirBnB price reveals the positive correlation (+ 0.48) between the variables. Nevertheless, it also becomes clear that there is more to an AirBnB price than just the average rent in the particular neighbourhood.

Figure 2: Mapping Rent Prices vs. AirBnB Prices

Figure 2: Mapping Rent Prices vs. AirBnB Prices

2.3 Location

When choosing an AirBnB in London, staying close to the city centre is prefered by many. Distance is defined as the distance to the touristic city centre - Picadilly Circus. It was calculated by using the Haversine formula (Reid, 2011) and the geographic coordinates of Picadilly Circus (Longitude: -0.133869, Latitude: 51.510067) (Latlong.net, 2017). The correlation between distance and the logarithmic AirBnB price is negative and weak (-0.39). The closer the property is to the city center, the higher the price is. Upon analyzing the different bins, the most high end outliers are located close to Picadilly Circus. Price range also shrinks the further the flat is from the city center.

Figure 3: Mapping Rent Prices vs. AirBnB Prices

Figure 3: Mapping Rent Prices vs. AirBnB Prices

2.4 Reviews

Reviews could be a useful indicator of various characteristics of the room advertised. In addition to the written reviews, guests can give their hosts star-ratings on the following parameters (Airbnb Inc., 2017): Overall experience, accuracy, cleanliness, communication, check in, location and value. Most of those are self-explanatory; accuracy represents the extent to which the online listing represtents the reality, and value is a subjective measure of whether the room was worth the price paid.

The guest ratings are translated into a score out of 10 for the individual categories, and a score out of 100 for the overall score. The mean value for many categories is 9 or 10. Such high scores are frequently seen when feedback from users is collected. For example, Uber considers removing drivers rated on average less than 4.6 stars out of 5 (Insider, 2015).

Since the overall score is submitted independently, rather than calculated from the category scores, it is interesting to see which categories affect the user’s overall rating the most. All the subcategory rating have at least a moderate, positive relation to the overall score. The correlations between the overall score and value, check in and accuracy are the strongest, suggesting that those categories matter most for the guest’s overall satisfaction. In general, there is no significant relation between the different rating scores and price, suggesting the use of these indicators will have little effect on the goodness of fit. Location, however, has a weak positive correlation to the logarithmic price, making it an interesting indicator for the security, comfort and attractiveness of the neighbourhood.

Table 3: Descriptives of Rating
Name Minimum Maximum Mean Correlation to Overall Score Correlation to Price
Accuracy 2 10 9 0.77 0.09
Check In 2 10 10 0.78 0.14
Cleanliness 2 10 9 0.67 0.10
Communication 4 10 10 0.68 0.10
Location 3 10 9 0.54 0.30
Value 2 10 9 0.79 0.04
Overall 20 100 92 1.00 0.13

2.5 Property Characteristics

2.5.1 Accommodates and Beds

Table 4: Descriptives of Capacity
Variable P-Value Conf-Int. Low Estimate Conf-Int. High
Accomodates 6.679629e-187 0.32 0.34 0.36
Beds 1.653387e-71 0.19 0.21 0.23

The variables accommodates (how many people can stay in the property) and beds (the number of beds in the property) give an indication of the overall capacity of the AirBnB. Both variables have a relation to the room price that is significantly different to zero. However, both correlations are weak, suggesting that even though price rises with capacity, it rises slowly.

2.5.2 Amenities

Figure 4: Percentage of Amenities

Figure 4: Percentage of Amenities

AirBnB includes some general information on the property such as the room type, the number of people that can be accommodated or the number of bathrooms. On top of these characteristics, AirBnB contains information on a wide range of amenities for every flat. These range from the availability of Internet and a TV to a personal doorman or a pool. In order to analyse these, dummy variables for 53 different amenities, with 46 resulting in usable data, as well as a variable counting the total number of amenities were introduced.

To reduce the count, seven amenities that lead to a significant difference in price were chosen. Prioritized were some home essentials such as access to a kitchen, a lock to secure the personal space, a TV, a dryer and a washer, facilities like an elevator and the family-friendliness of the room. The presence of a TV, an elevator, a dryer or a washer as well as the family friendliness of a room tends to have a positive impact on the flat’s price, this is especially true for the TV. Interestingly, it seems that AirBnBs that have acces to a kitchen and a lock on the bedroom door seem to be slightly less valued. Perhaps a lock on the bedroom door is more commonly in place in less safe locations.

Table 5: Descriptives of Selected Amenities
Amenities P-Value Mean ln(price) With Mean ln(price) Without Price Difference
Washer 0.03 3.83 3.80 0.03
TV 0.00 3.89 3.74 0.15
Familiy / Kid-Friendly 0.00 3.87 3.79 0.08
Dryer 0.00 3.92 3.77 0.15
Kitchen 0.52 3.82 3.83 -0.01
Elevator in Building 0.00 3.89 3.79 0.10
Lock on Bedroom Door 0.00 3.79 3.83 -0.04

2.6 Attributes of the ad

Usually, a guest needs to submit a booking request and gets to stay in the property only if the host approves that request. To attract more customers, some hosts allow instant booking of their properties, which is similar to booking a hotel - the user just books the property straight away. In the dataset, TRUE means guests can book the desired property instantly, while FALSE means they have to get approval from the host first.

In addition to instant book, hosts also have the right to choose their own cancellation policy. Cancellation policy determines whether or not guests can get a refund and how they can be refunded. There are several cancellation policies form which hosts can choose, including flexible, moderate, strict and super strict. If flexible, guests may get a full refund if the reservation is cancelled within a limited period, typically 24 hours prior to the check in. If moderate, fees are fully refundable but only if cancelled a longer time in advance. Under the strict policy, only 50% of fees may be refunded if the booking is cancelled more than 1 week before check in. (Airbnb Inc., 2017) While the difference in mean of rooms with instant bookings is insignificant, the correlation between the scale version of cancellation policy is significantly but weakly correlated to the price of the room.

## function (x, do.NULL = TRUE, prefix = "col") 
## {
##     if (is.data.frame(x) && do.NULL) 
##         return(names(x))
##     dn <- dimnames(x)
##     if (!is.null(dn[[2L]])) 
##         dn[[2L]]
##     else {
##         nc <- NCOL(x)
##         if (do.NULL) 
##             NULL
##         else if (nc > 0L) 
##             paste0(prefix, seq_len(nc))
##         else character()
##     }
## }
## <bytecode: 0x000000001f00d4d0>
## <environment: namespace:base>
Table 6: Descriptives of Ad Properties
attributes p_vals
Instant Bookable 8.435e-01
Cancellation Policy 1.513e-08

3 Regression model

Table 7: Regression Results
Dependent variable:
Ln Price
(1) (2)
Mean Rent 0.001*** (0.0001) 0.001*** (0.0001)
Distance -0.013*** (0.001) -0.013*** (0.001)
Review Score - Rating 0.007*** (0.001)
Review Score - Accuracy -0.016** (0.008)
Review Score - Check-In 0.014* (0.008)
Review Score - Cleanliness 0.040*** (0.006) 0.037*** (0.004)
Review Score - Communication 0.001 (0.009)
Review Score - Location 0.080*** (0.006) 0.072*** (0.006)
Review Score - Value -0.083*** (0.008)
Accomodates 0.160*** (0.006) 0.151*** (0.005)
Number of Beds -0.023** (0.010)
Amenity - Dryer 0.072*** (0.008) 0.062*** (0.008)
Amenity - Elevator 0.045*** (0.008) 0.043*** (0.008)
Amenity - Family friendly 0.007 (0.008)
Amenity - Lock on Bedroom Door -0.045*** (0.010) -0.049*** (0.010)
Amenity - TV 0.116*** (0.008) 0.116*** (0.008)
Amenity - Washer -0.044*** (0.010)
Instant bookable - FALSE 0.004 (0.010)
Cancellation Policy - Moderate -0.008 (0.009)
Cancellation Policy - Strict 2.111*** (0.070) 2.003*** (0.054)
Observations 7,020 7,020
R2 0.433 0.420
Adjusted R2 0.432 0.420
Residual Std. Error 0.304 (df = 7000) 0.307 (df = 7010)
F Statistic 281.780*** (df = 19; 7000) 565.119*** (df = 9; 7010)
Note: p<0.1; p<0.05; p<0.01

3.1 Interpretation

As our dependent variable was transformed to its logarithmic version, a log-linear regression model is used to explain the effect of the independent variables on the dependent variable. Comparing the two versions of the model, it becomes clear that some variables are insignificant, some have a multicollinearity problem and the review scores for value is likely to have an endogenity problem. Additionally, some of the amenities had a negative impact on the price, which is conterintuitive and contradicts the results of the t-test. As the effects are small and likely to be caused by random noise, such variables are excluded:

\[\begin{gather*} ln(price) = \beta_0 + \beta_1(mean\_rent) + \beta_2(distance) + \beta_3(review\_scores\_cleanliness) + \\ \beta_4(review\_scores\_location) + \beta_5(accomodates) + \beta_6(dryer) + \\ \beta_7(elevator) + \beta_8(lock) + \beta_9(TV) + u \end{gather*}\]

42 percent of the variation of the dependent variable can be explained with the presented regression model. The standard error of the model in absolute currency is approximately 1.36 GBP off from the real value and the F-statistic is highly significant. Thus, the model provides a far better explanation than just the fit intercept model. The y-intercept is located at 7.41 GBP. However, there will not be an apartment that does have a rent of zero or can accomodate no one. Therefore, the intercept has to used rather carefully. The other coefficients are explaining by how many percentage points the price changes if the explanatory variable changes by one unit holding all other independent variables constant. For example, for every additional person a room can accomodate, the price rises by 15 percent. The former, the review scores for location as an indicator for attractiveness of the neighbourhood and the presence of a TV have the largest postive effects on the price of a room. If there is a lock on the bedroom door the effect on the price is negative. The effect of distance on room price appears to be small but putting it into context reveals that for every kilometer further from Piccadilly Circus the price shrinks by 1.3 percent. That is only a 12min walk.

3.2 Fitting the model

Table 8: Residuals
Name Mean LN Price Mean Price
Data with high residuals 4.37 85.05
Data with low residuals 3.73 44.14

3.2.1 Residuals

A short exploration of the residuals shows that the mean of observations with larger residuals was higher compared to the other observations. As the presented scatter plot illustrates, the model is worse in predicting more expensive rooms accuaretly. The factors chosen do not fully explain the difference in price. The Durbin-Watson test shows that the error values are uncorrelated.

Figure 5: Residuals and ln(price)

Figure 5: Residuals and ln(price)

Table 10: Results Durbin Watson Test
Autocorrelation D-W Statistic p-value
0.06 1.88 0

3.2.2 Sensitivity to outliers

As in any regression model based on ordinary least squares, the coefficients in our model are affected by outliers. Some of the properties in our data set cost more than $400 per night, while most of them cost below $100. The outliers may have disproportionately affect our coefficients, making them less accurate for the remaining variables. However, as it would be bad practice to exclude certain observations of the regression, none of them are treated.

3.2.3 Multicollinearity

Most correlated variables were already excluded from the regression, as those relations might increase the error terms of the model. A VIF of four implies that the variance of the estimators in the model are four times higher than if the independent variables were uncorrelated. Usually, a VIF greater than 3 is considered critical to the model results. None of the used variables reaches that border value.

Table 9: Results Multicollinearity Test
VIF
Mean Rent 1.76
Distance 1.68
Review Score - Cleanliness 1.28
Review Score - Location 1.37
Accomodates 1.03
Amenity - Dryer 1.05
Amenity - Elevator 1.02
Amenity - Lock on Bedroom 1.03
Amenity - TV 1.07

3.2.4 Omitted variable bias

The price of an AirBnB is affected by a large number of factors. The presented model includes some of them, but it was not feasible or possible to include data concerning every single possible determinant. As a result, the model likely suffers from omitted variable bias. It under- or overestimates the effect of some of the existing factors to compensate for the missing information, making the model less reliable.

3.2.5 Lack of clustering

By putting all properties into one model, we ignore the fact that there might be different profiles of properties and for each profile, different characteristics might be relatively more important. Perhaps there is a set of properties that are popular with students coming to London for graduate job interviews, who would see location close to the financial centers and low price as important factors. And, perhaps, different types of properties are popular with middle-aged tourists - then the proximity to the popular sights and the level of comfort provided might matter more. If we divided our properties into clusters which share similar characteristics, and then ran a regression analysis for each cluster, we might get a more accurate model for each cluster.

3.2.6 Limitations

The dataset does not contain several important variables, such as the size of the room, the proximity of the flat to a tube station, the age of the flat,the quality of the equipment and furniture in the flat or the attractiveness of the apartment and the building. Additionally, the attractiveness of the room and the house it is in were not quantifiable. As this attractiveness differs across buildings and sometimes even within a building, it is impossible to predict the price of an apartment that exceeds expectations set by the base explanatory variables used in the regression.

Figure 6: Predicted and Real Value

Figure 6: Predicted and Real Value

4 Conclusion

Despite the fact that the presented model has obvious limitations regarding factors that could not be quantified, it has direct implications for finding a reasonably prices apartment. Many attributes important for someone searching for a room, like WiFi and the existence of a proper equipped kitchen, have small effects on the room price, as they are present in most London apartments. A traveller can therefore expect to have those properties present. Luxury amenities like the presence of an elevator, a TV and a dryer create costs. Depending on the standards of the guest, these can be added if the budget is extended. It is also good advice to check apartments in less attractive neighbourhoods to save money. Regarding cleanliness, a well maintained room will cost more. Looking at these different attributes of an AirBnB ad, the user is able to determine whether the price of the apartment is actually fair, which was the aim of this report. As especially high prices could not be explained in the model, a prediction is likely to return a base price rather than a highly attractive room in a good apartment in a nice building. In order to improve the scope of this paper, the following extensions should be recommended.

Bibliography

Airbnb Inc. (2017) How do star ratings work. [Online]. Available from: https://de.airbnb.com/help/article/1257/how-do-star-ratings-work.

Cox, M. (2017) Inside airbnb - adding data to the debate. [Online]. Available from: http://data.insideairbnb.com/united-kingdom/england/london/2017-03-04/data/listings.csv.gz.

Latlong.net (2017) Get latitude and longitude. [Online]. Available from: https://www.latlong.net.

Lokku Ltd. (2017) London house prices by postcode. [Online]. Available from: https://www.findproperly.co.uk/london/postcode/#.WdvonHeZNn4.

Reid, M. (2011) Haversine formula. [Online]. Available from: http://wordpress.mrreid.org/2011/12/20/haversine-formula/.

Furthermore, for plotting our observations on a ggmap, we consulted the following sources:

Irawan, D.E. (2014) How to convert lat-long coordinates to utm. [Online]. Available from: https://rpubs.com/dasaptaerwin/19879.

Lovelace, R. & Cheshire, J. (2014) Introduction to visualising spatial data in R. National Centre for Research Methods Working Papers. [Online] 14 (03). Available from: https://github.com/Robinlovelace/Creating-maps-in-R.

The header photo was downloaded from Pexels and is licence free. Available from: https://www.pexels.com/photo/architecture-buildings-business-capital-417382/

Imperial College Business School